In this document, different performance measures for prediction based on machine learning models are discussed. First, attention is given to some global performance measures. These measures are useful when comparing between different models or when tuning the parameters within a model, but they cannot be used for explaining why a model made a specific prediction. Therefore, we also look at local perfromance measures, such as Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP).
When explaining the different concepts, we will also show an application based on the Concrete Compressive Strength Data Set, which is experimental data from UCI (University of California machine learning repository) and discussed in Yeh (2018).
library(readxl)
library(DT)
dat = read_excel("Concrete_Data.xls",sheet=1)
datatable(dat)
names(dat) = c("Cement","BFS","Ash","Water","Plast","CA","FA","Age","CCS")
datatable(dat)
x_var = names(dat)[1:8]
The dataset consists of a continuous response, being the Concrete compressive strength, and 8 input features, which are all quantitative. The response looks as follows
boxplot(dat$CCS,horizontal=TRUE)
For example purposes, we also add a binary response variable, defined as 1 when CCS>35 and 0 otherwise, which leads to a balanced outcome:
dat_Bin = dat
dat_Bin$Indicator = (dat$CCS>35)*1
dat_Bin$Indicator =as.factor(c("Low","High")[dat_Bin$Indicator+1])
table(dat_Bin$Indicator)
##
## High Low
## 507 523
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
featurePlot(x = dat_Bin[, 1:8],
y = as.factor(dat_Bin$Indicator),
plot = "pairs",
## Add a key at the top
auto.key = list(columns = 3))
## Warning in draw.key(simpleKey(...), draw = FALSE): not enough rows for columns
When using machine learning methodology, the original dataset is split into two parts: a bigger part (e.g. 80% of the original data) for training the models (training set) and a smaller part (e.g. the remaining 20%) for assessing the model performance (test set). In addition, when the model parameters need to be tuned, techniques like cross-validation can be used to further split the training set to also obtain validation sets which can be used for parameter optimization. A 5-fold cross-validation approach is graphically depicted in the Figure below. In split 1, the model is trained on folds 2-4 and validated on fold 1 (which is the left-out fold in this case). The same procedure is repeated for splits 2-5 and an averaged performance metric can be calculated. The parameter value corresponding to the best averaged performance metric is selected as the optimal choice for the model under investigation.
When
calculating the global performance measures, \(N\) in the formulas below refers to either
the number of observations in the test set or to the number of
observations in the left-out fold, depending on whether the user wishes
to compare between models or determine the optimal parameters within a
model, respectively.
dat_scaled <- data.frame(scale(dat))
dat_scaled$Indicator = dat_Bin$Indicator
set.seed(3456)
trainIndex <- createDataPartition(dat_scaled$Indicator, p = .8,
list = FALSE,
times = 1)
datTrain_scaled <- dat_scaled[ trainIndex,]
datTest_scaled <- dat_scaled[-trainIndex,]
datTrain <- dat_Bin[ trainIndex,c(1:9)]
datTest <- dat_Bin[-trainIndex,c(1:9)]
datTrain_bin <- dat_Bin[ trainIndex,c(1:8,10)]
datTest_bin <- dat_Bin[-trainIndex,c(1:8,10)]
datatable(datTrain)
datatable(datTest)
For continuous outcomes, the three most common performance metrics are Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Below, these metrics are further detailed upon and are applied using a 1) linear regression model and 2) a gradient boosting machine fitted on the training dataset introduced above:
mod_lin<- train(CCS ~ ., data = datTrain,
method = "lm")
pred = predict(mod_lin, datTest)
plot(pred,datTest$CCS)
mod_gbm = train(CCS ~ ., data = datTrain,
method = "gbm",verbose=FALSE)
pred_gbm = predict(mod_gbm, datTest)
plot(pred,datTest$CCS)
points(pred_gbm,datTest$CCS,col="red")
abline(a=0,b=1,lwd=2,col="blue")
mod_log = train(Indicator ~ ., data = datTrain_bin,
method = "glm",
family = "binomial")
pred_bin = predict(mod_log, datTest_bin)
ctrl <- trainControl(method = "cv",number=5)
set.seed(1)
mod_gbm_bin <- train(Indicator ~ ., data = datTrain_bin,
method = "gbm",
verbose = FALSE,
trControl = ctrl)
pred_gbm_bin <- predict(mod_gbm_bin, newdata=datTest_bin)
From the graph above, it is observed that the results of the linear model (shown in black) seem to be less in line with the true observations as compared to the predictions based on the GBM. In addition, it should however be noted that no parameter tuning for the GBM is performed and hence this model could still be improved. For example purposes, this latter step was not included. Based on the global performance measures below, a more formal comparison between the models will be made.
The Mean Absolute Error is the average of the difference between the ground truth and the predicted values. Mathematically, its represented as : \[ MAE = \frac{1}{N} \sum_{i=1}^N{\left|y_i - \hat{y_i}\right|} \]
where
Applied to the CCS data, we obtain:
MAE = mean(abs(datTest$CCS - pred))
print(MAE)
## [1] 8.425781
MAE_gbm = mean(abs(datTest$CCS - pred_gbm))
print(MAE_gbm)
## [1] 4.1589
which means that our linear regression model makes an average absolute error of 8.4258MPa as compared to an average absolute error of 4.1589MPa for the GBM model.
The mean squared error is perhaps the most popular metric used for regression problems. It essentially finds the average of the squared difference between the target value and the value predicted by the regression model: \[ MSE = \frac{1}{N} \sum_{i=1}^N{(y_i - \hat{y_i})^2} \] It should be noted that the error interpretation has to be done with squaring factor in mind. This is remedied by looking at the square root of the MSE, denoted by the root mean squared error, abbreviated by RMSE: \[ RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^N{(y_i - \hat{y_i})^2} } \]
Applied to the CCS data, we obtain:
MSE = mean((datTest$CCS - pred)^2)
print(MSE)
## [1] 112.5587
RMSE = sqrt(MSE)
print(RMSE)
## [1] 10.60937
MSE_gbm = mean((datTest$CCS - pred_gbm)^2)
print(MSE_gbm)
## [1] 31.61108
RMSE_gbm = sqrt(MSE_gbm)
print(RMSE_gbm)
## [1] 5.622373
which means that our linear regression model makes an average squared error of 112.5587MPa squared as compared to an average squared error of 31.6111MPa squared for the GBM predictions. Or equivalently, the average error is 10.6094 MPa for the linear model and 5.6224MPa for the GBM. One can conclude that the GBM is most appropriate when the aim is to predict new observations.
For binary (or more general categorical) outcomes, one often makes a confusion matrix, from where different metrics can be derived. Some of them include:
which are explained on the Figure below.
In addition, we also have
An application is found below, where the performance of the logistic regression model on the GBM model fitted on the CCS data with the binary indicator outcome is compared.
confusionMatrix(pred_bin,datTest_bin$Indicator, mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 85 11
## Low 16 93
##
## Accuracy : 0.8683
## 95% CI : (0.8142, 0.9114)
## No Information Rate : 0.5073
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7363
##
## Mcnemar's Test P-Value : 0.4414
##
## Sensitivity : 0.8416
## Specificity : 0.8942
## Pos Pred Value : 0.8854
## Neg Pred Value : 0.8532
## Precision : 0.8854
## Recall : 0.8416
## F1 : 0.8629
## Prevalence : 0.4927
## Detection Rate : 0.4146
## Detection Prevalence : 0.4683
## Balanced Accuracy : 0.8679
##
## 'Positive' Class : High
##
confusionMatrix(pred_gbm_bin,datTest_bin$Indicator, mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 93 10
## Low 8 94
##
## Accuracy : 0.9122
## 95% CI : (0.8648, 0.9471)
## No Information Rate : 0.5073
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8244
##
## Mcnemar's Test P-Value : 0.8137
##
## Sensitivity : 0.9208
## Specificity : 0.9038
## Pos Pred Value : 0.9029
## Neg Pred Value : 0.9216
## Precision : 0.9029
## Recall : 0.9208
## F1 : 0.9118
## Prevalence : 0.4927
## Detection Rate : 0.4537
## Detection Prevalence : 0.5024
## Balanced Accuracy : 0.9123
##
## 'Positive' Class : High
##
The PIP can be used for both binary and continuous outcomes as it quantifies how often one model outperforms the other when predicting new individual observations. Below, the repCV5 PIP is considered:
PIP = mean((datTest$CCS - pred_gbm)^2 < (datTest$CCS - pred)^2)
print(PIP)
## [1] 0.7463415
set.seed(1988)
reps=5
K=5
PIP_cv=c()
for(l in 1:reps){
dat=datTrain
yourData<-dat[sample(nrow(dat)),]
cvIndex <- createFolds(dat$CCS, K, returnTrain = T)
#Perform K fold cross validation
pip_cv = c()
for(j in names(cvIndex)){
trainData = yourData[cvIndex[[j]],]
testData = yourData[-cvIndex[[j]],]
x_train <- trainData[, 1:8]
y_train <- trainData[, 9]
x_test <- testData[, 1:8]
y_test <- testData[, 9]
#pars = mod_gbm$bestTune
# gbmGrid <- expand.grid(interaction.depth = pars$interaction.depth,
# n.trees = pars$n.trees,
# shrinkage = pars$shrinkage,
# n.minobsinnode = pars$n.minobsinnode)
mod1= train(CCS ~ ., data = trainData,
method = "gbm",verbose=FALSE)
mod0<- train(CCS ~ ., data = trainData,
method = "lm")
pred0 = predict(mod0, testData)
pred1 = predict(mod1, testData)
pip_cv = c(pip_cv,mean((pred1-y_test)^2 < (pred0-y_test)^2) + 0.5*mean((pred1-y_test)^2 == (pred0-y_test)^2) )
}
PIP_cv = c(PIP_cv,mean(pip_cv))
}
print(mean(PIP_cv))
## [1] 0.7617274
PIP = mean(((datTest_bin$Indicator=="Low")*1 - (pred_gbm_bin=="Low")*1)^2 < ((datTest_bin$Indicator=="Low")*1 - (pred_bin=="Low")*1)^2)+0.5*mean(((datTest_bin$Indicator=="Low")*1 - (pred_gbm_bin=="Low")*1)^2 == ((datTest_bin$Indicator=="Low")*1 - (pred_bin=="Low")*1)^2)
print(PIP)
## [1] 0.5219512
set.seed(1988)
reps=5
K=5
PIP_cv=c()
for(l in 1:reps){
dat=datTrain_bin
yourData<-dat[sample(nrow(dat)),]
cvIndex <- createFolds(dat$Indicator, K, returnTrain = T)
#Perform K fold cross validation
pip_cv = c()
for(j in names(cvIndex)){
trainData = yourData[cvIndex[[j]],]
testData = yourData[-cvIndex[[j]],]
x_train <- trainData[, 1:8]
y_train <- trainData[, 9]
x_test <- testData[, 1:8]
y_test <- testData[, 9]=="Low"
#pars = mod_gbm$bestTune
# gbmGrid <- expand.grid(interaction.depth = pars$interaction.depth,
# n.trees = pars$n.trees,
# shrinkage = pars$shrinkage,
# n.minobsinnode = pars$n.minobsinnode)
mod0 = train(Indicator ~ ., data = trainData,
method = "glm",
family = "binomial")
pred0 = predict(mod0, testData)=="Low"
ctrl <- trainControl(method = "cv",number=5)
set.seed(1)
mod1 <- train(Indicator ~ ., data = trainData,
method = "gbm",
verbose = FALSE,
trControl = ctrl)
pred1 <- predict(mod1, newdata=testData)=="Low"
pip_cv = c(pip_cv,mean((pred1-y_test)^2 < (pred0-y_test)^2) + 0.5*mean((pred1-y_test)^2 == (pred0-y_test)^2) )
}
PIP_cv = c(PIP_cv,mean(pip_cv))
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(mean(PIP_cv))
## [1] 0.5219198
pred_gbm_bin_raw <- predict(mod_gbm_bin, newdata=datTest_bin,type="prob")[,2]
pred_log_bin_raw <- predict(mod_log, newdata=datTest_bin,type="prob")[,2]
PIP_raw = mean(((datTest_bin$Indicator=="Low")*1 - pred_gbm_bin_raw)^2 < ((datTest_bin$Indicator=="Low")*1 - pred_log_bin_raw)^2)+0.5*mean(((datTest_bin$Indicator=="Low")*1 - pred_gbm_bin_raw)^2 == ((datTest_bin$Indicator=="Low")*1 - pred_log_bin_raw)^2)
print(PIP_raw)
## [1] 0.8146341
From here, we see that while the binary predictions based on logistic regression and GBM are pretty close (PIP = 0.522), the underlying raw probabilities are targeted more towards the extremes for GBM (PIP = 0.8146)
Above, it was seen that the GBM often outperforms standard linear or logistic regression models when it comes down to predictive accuracy. One of the mentioned disadvantages for the GBM is that it is less interpretable, in the sense that one cannot tell directly how specific covariates influence the prediction. Nevertheless, this disadvantage is is easily addressed with various tools such as variable importance, partial dependence plots, LIME, SHAP,…).
library(pdp)
library(ggplot2)
for(x in x_var){
par1 <- partial(mod_gbm, pred.var = x, chull = TRUE)
plot1 <- autoplot(par1, contour = TRUE)
print(plot1)
}
# Two Variables
par2 <- partial(mod_gbm, pred.var = x_var[1:2], chull = TRUE)
plot2 <- autoplot(par2, contour = TRUE,
legend.title = "Partial\ndependence")
print(plot2)
I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).